Skip to content

Conversation

@shawbyoung
Copy link
Contributor

@shawbyoung shawbyoung commented Jul 24, 2024

Extended function and block pseudo probe stale matching

Follow-up to #99891.

This change brings flexible profile recovery using pseudo probes:

  • using multiple profiles for one binary function, e.g. if the binary
    function inlines profile functions,
  • using parts of one profile across multiple binary functions, e.g. if
    the profile has more aggressive inlining compared to the input binary.

Prerequisites

  • YAML profile must be created with --profile-write-pseudo-probes
    perf2bolt flag.
  • Pseudo probe matching is enabled with
    --stale-matching-with-pseudo-probes which augments hash-based
    matching + profile inference enabled with --infer-stale-profile
    (required).
  • The binary must contain pseudo probes enabled with
    -fpseudo-probe-for-profiling Clang option.

Performance testing

  • Using two LTO+CSSPGO builds of Clang ([Clang][CMake] Add CSSPGO support to LLVM_BUILD_INSTRUMENTED #79942) from the same source
    with different profiles causing different inlining decisions,
  • Clang-BOLT testing harness https://github.com/aaupov/llvm-devmtg-2022
  • 2S Intel SKX Xeon 6138 with 40C/80T and 256GB RAM, using 20C/40T for
    build,
  • Bootstrapped build of Clang, wall time:
    • no BOLT: 301.87 +- 1.44 seconds, 0.91 IPC
    • stale profile: 295.242 +- 0.683 seconds, 0.93 IPC
    • hash matching: 289.035 +- 0.656 seconds, 0.95 IPC
    • probe matching: 283.558 +- 0.927 seconds, 0.96 IPC
    • fresh profile: 275.131 +- 0.444 seconds, 0.99 IPC

Depends on:

Co-authored-by: aaupov@fb.com

Created using spr 1.3.4
Created using spr 1.3.4
Created using spr 1.3.4
aaupov added a commit that referenced this pull request Sep 4, 2024
To be used for pseudo probe function matching (#100446).

Test Plan: updated pseudoprobe-decoding-inline.test

Pull Request: #107137
…BF, determine best matching binary inline subtrees, find corresponding binary functions, attach the profile to binary function without a profile and most probes

Created using spr 1.3.4
Created using spr 1.3.4
Created using spr 1.3.4
@github-actions
Copy link

github-actions bot commented Nov 8, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

Created using spr 1.3.4
aaupov added a commit that referenced this pull request Nov 12, 2024
Match inline trees first between profile and the binary: by GUID,
checksum, parent, and inline site for inlined functions. Map profile
probes to binary probes via matched inline tree nodes. Each binary probe
has an associated binary basic block. If all probes from one profile
basic block map to the same binary basic block, it’s an exact match,
otherwise the block is determined by majority vote and reported as loose
match.

Pseudo probe matching happens between exact hash matching and call/loose
matching.

Introduce ProbeMatchSpec - a mechanism to match probes belonging to
another binary function. For example, given functions foo and bar:
```
void foo() {
  bar();
}
```
profiled binary: bar is not inlined => have top-level function bar
new binary where the profile is applied to: bar is inlined into foo.

Currently, BOLT does 1:1 matching between profile functions and binary
functions based on the name. #100446 will extend this to N:M where
multiple profiles can be matched to one binary function (as in the
example above where binary function foo would use profiles for foo and
bar), and one profile can be matched to multiple binary functions (e.g.
if bar was inlined into multiple functions).

In this diff, ProbeMatchSpecs would only have one BinaryFunctionProfile
(existing name-based matching). 

Test Plan: Added match-blocks-with-pseudo-probes.test

Performance test:
- Setup:
  - Baseline no-BOLT: Clang with pseudo probes, ThinLTO + CSSPGO
  (#79942)
  - BOLT fresh: BOLTed Clang using fresh profile,
  - BOLT stale (hash): BOLTed Clang using stale profile (collected on
    Clang 10K commits back), `-infer-stale-profile` (hash+call block
    matching)
  - BOLT stale (+probe): BOLTed Clang using stale profile,
    `-infer-stale-profile` with `-stale-matching-with-pseudo-probes`
    (hash+call+pseudo probe block matching)
  - 2S Intel SKX Xeon 6138 with 40C/80T and 256GB RAM, using 20C/40T for
    build,
  - BOLT profiles are collected on Clang compiling large preprocessed
    C++ file.
- Benchmark: building Clang (average of 5 runs), see driver in
  aaupov/llvm-devmtg-2022
- Results, wall time, lower is better:
  - Baseline no-BOLT: 429.52 +- 2.61s,
  - BOLT stale (hash): 413.21 +- 2.19s,
  - BOLT stale (+probe): 409.69 +- 1.41s,
  - BOLT fresh: 384.50 +- 1.80s.

---------

Co-authored-by: Amir Ayupov <aaupov@fb.com>
Created using spr 1.3.4
Created using spr 1.3.4
Created using spr 1.3.4
Created using spr 1.3.4
@aaupov aaupov marked this pull request as ready for review October 29, 2025 06:25
aaupov added a commit that referenced this pull request Nov 6, 2025
Pseudo probe matching (#100446) needs callee information for call
probes.
Embed call probe information (probe id, inline tree node, indirect flag)
into CallSiteInfo. As a consequence:
- Remove call probes from PseudoProbeInfo to avoid duplication, making 
  it only contain block probes.
- Probe grouping across inline tree nodes becomes more potent + allows
  to unambiguously elide block id 1 (common case).

Block mask (blx) encoding becomes a low-ROI optimization and will be
replaced by a more compact encoding leveraging simplified
PseudoProbeInfo in #166680.

The size increase is ~3% for an XL profile (461->475MB). Compact block
probe encoding shrinks it by ~6%.

Test Plan: updated pseudoprobe-decoding-{inline,noinline}.test
aaupov added a commit that referenced this pull request Nov 11, 2025
Pseudo probe matching (#100446) needs callee information for call probes.
Embed call probe information (probe id, inline tree node, indirect flag)
into CallSiteInfo. As a consequence:
- Remove call probes from PseudoProbeInfo to avoid duplication, making
  it only contain block probes.
- Probe grouping across inline tree nodes becomes more potent + allows
  to unambiguously elide block id 1 (common case).

Block mask (blx) encoding becomes a low-ROI optimization and will be
replaced by a more compact encoding leveraging simplified PseudoProbeInfo
in #166680.

The size increase is ~3% for an XL profile (461->475MB). Compact block
probe encoding shrinks it by ~6%.

Test Plan: updated pseudoprobe-decoding-{inline,noinline}.test

Reviewers: paschalis-mpeis, ayermolo, yota9, yozhu, rafaelauler, maksfb

Reviewed By: rafaelauler

Pull Request: #165490
for (auto &[ParentId, InlineSiteCSI] : ParentToCSI) {
for (auto &[InlineSite, CSI] : InlineSiteCSI) {
auto [CalleeGUID, CallSite] = InlineSite;
errs() << ParentId << "@" << CallSite << "->"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errs() --> dbgs()

++ProbeMatchingStats.MismatchingRootGUID;
return 0;
}
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use the extra { }?

{
uint64_t BinaryHash = Decoder.getFuncDescForGUID(Root.Guid)->FuncHash;
uint64_t ProfileHash = FuncNode.Hash;
if (BinaryHash != ProfileHash) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is due to the stale profile, right?

++ProbeMatchingStats.MatchedRoots;

uint64_t Matched = 1;
for (const auto &[Idx, ProfileNode] :
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some comments for this for loop?

return Root;
}

size_t YAMLProfileReader::matchInlineTreesImpl(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some more high-level descriptions(like more details for what each step does and for different cases) for this function? The function is complex, and it would help future readers.

Also do you think we have good test coverage for this function? it looks it all reuses the existing tests, no test for the new feature? Though I understand it probably needs a large binary/profile, if that's the case, i'm also fine as long as you have tested locally.

BinaryFunction &BF, yaml::bolt::BinaryFunctionProfile &YamlBF,
const MCDecodedPseudoProbeInlineTree &Root, uint32_t RootIdx,
ArrayRef<yaml::bolt::InlineTreeNode> ProfileInlineTree,
MutableArrayRef<const MCDecodedPseudoProbeInlineTree *> Map, float Scale) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks scale is unused

}
}
if (opts::StaleMatchingWithPseudoProbes) {
SmallVector<StringRef, 0> Suffixes(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 -->5

Comment on lines +730 to +731
uint64_t BinaryHash = Decoder.getFuncDescForGUID(Root.Guid)->FuncHash;
uint64_t ProfileHash = FuncNode.Hash;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: constify.

ProfiledFunctions.emplace(&BF);
}

/// Return a top-level binary inline tree node for a given \p BF
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Return a top-level binary inline tree node for a given \p BF
/// Return a top-level binary inline tree node for a given \p BF.

BC.Stats.ExactMatchedSampleCount, BC.Stats.StaleSampleCount);
BC.outs() << format(
"BOLT-INFO: inference found an exact pseudo probe match for %.2f%% of "
"BOLT-INFO: inference found pseudo probe match for %.2f%% of "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to print this info if pseudo probes are not present?

HashFunction HashFunction, YAMLProfileReader::ProfileLookupMap &IdToYamlBF,
const BinaryFunction &BF,
const ArrayRef<YAMLProfileReader::ProbeMatchSpec> ProbeMatchSpecs);
size_t matchWeights(BinaryContext &BC,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BinaryContext can be obtained from BinaryFunction.

BlendedBlockHash BinHash = BlendedHashes[MatchedBlock->Index - 1];
LLVM_DEBUG(dbgs() << "Matched yaml block (bid = " << YamlBBIdx << ")"
<< " with hash " << Twine::utohexstr(YamlBB->Hash)
LLVM_DEBUG(dbgs() << "Matched yaml block (bid = " << YamlBB.Index << ")"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LLVM_DEBUG(dbgs() << "Matched yaml block (bid = " << YamlBB.Index << ")"
LLVM_DEBUG(dbgs() << "BOLT-DEBUG: matched yaml block (bid = " << YamlBB.Index << ")"

}
const BinaryFunctionProfile *Callee = IdToYamLBF.lookup(CSI.DestId);
if (!Callee) {
LLVM_DEBUG(dbgs() << "no callee for " << CSI.DestId << " " << CSI.Count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LLVM_DEBUG(dbgs() << "no callee for " << CSI.DestId << " " << CSI.Count
LLVM_DEBUG(dbgs() << "BOLT-DEBUG: no callee for " << CSI.DestId << " " << CSI.Count

}
// Get callee GUID
if (Callee->InlineTree.empty()) {
LLVM_DEBUG(dbgs() << "no inline tree for " << Callee->Name << '\n');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LLVM_DEBUG(dbgs() << "no inline tree for " << Callee->Name << '\n');
LLVM_DEBUG(dbgs() << "BOLT-DEBUG: no inline tree for " << Callee->Name << '\n');

});

assert(!Root.isRoot());
LLVM_DEBUG(dbgs() << "matchInlineTreesImpl for " << BF << "@"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LLVM_DEBUG(dbgs() << "matchInlineTreesImpl for " << BF << "@"
LLVM_DEBUG(dbgs() << "BOLT-DEBUG: matchInlineTreesImpl for " << BF << "@"


// Match profile function with a lead node (top-level function or inlinee)
if (Root.Guid != FuncNode.GUID) {
LLVM_DEBUG(dbgs() << "Mismatching root GUID\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LLVM_DEBUG(dbgs() << "Mismatching root GUID\n");
LLVM_DEBUG(dbgs() << "BOLT-DEBUG: mismatching root GUID\n");

uint64_t BinaryHash = Decoder.getFuncDescForGUID(Root.Guid)->FuncHash;
uint64_t ProfileHash = FuncNode.Hash;
if (BinaryHash != ProfileHash) {
LLVM_DEBUG(dbgs() << "Mismatching hashes: "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LLVM_DEBUG(dbgs() << "Mismatching hashes: "
LLVM_DEBUG(dbgs() << "BOLT-DEBUG: mismatching hashes: "

and more cases below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants